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Abstract 

Variable-length splittable codes are derived from encoding sequences of ordered integer pairs, where one of the 
pair’s components is upper bounded by some constant, and the other one is any positive integer. Each pair is encoded 
hy the concatenation of two fixed independent prefix encoding functions applied to the corresponding components 
of a pair. The codeword of such a sequence of pairs consists of the sequential concatenation of corresponding pair’s 
encodings. We call such codes splittable. We show that Fibonacci codes of higher orders and codes with multiple 
delimiters of the form Oil... 10 are splittable. Completeness and universality of multi-delimiter codes are proved. 
Encoding of integers by multi-delimiter codes is considered in detail. For these codes, a fast byte aligned decoding 
algorithm is constructed. The comparative compression performance of Fibonacci codes and different multi-delimiter 
codes is presented. By many useful properties, multi-delimiter codes are superior to Fibonacci codes. 

Index Terms 

Prefix code, Fibonacci code, data compression, robustness, completeness, universality, density, multi-delimiter 

I. Introduction 

The present period of the information infrastructure development is distinguished by the active interaction of 
various computer applications with huge Information Retrieval Systems. This activity actualizes the demand for 
efficient data compression methods that on one hand provide satisfactory compression rate, and, on the other, support 
fast search operations in compressed data. Along with this, the need for code robustness in the sense of limiting 
possible error propagations has been also strengthened. 

As is known, in large textual databases classical Huffman codes [1], when applied to words considered as 
symbols, show good compression efficiency approaching to the theoretically best. Unfortunately, Huffman’s encoding 
does not allow a fast direct search in compressed data by a given compressed pattern. At the expense of losing 
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some compression efficiency, this was amended by introducing byte aligned tagged Huffman codes. They are 
Tagged Huffman Codes [2], End-Tagged Dense Codes (ETDC) [3], and (s,c)-Dense Codes (SCDC) [4]. In these 
constructions, codewords are represented as sequences of bytes, which along with encoded information incorporate 
flags for the end of a codeword. 

The alternative approach for compression coding stems from using Eibonacci numbers of higher orders. The 
mathematical study of Eibonacci codes was started in the pioneering paper [5]. The authors first introduced a 
family of Eibonacci codes of higher orders with the emphasis on their robustness. They proved completeness and 
universality of these codes. 

The strongest argumentation for the use of Eibonacci codes of higher orders in data compression is given in [6], 

[7]. Eor these codes, the authors developed fast byte aligned algorithms for decoding [8] and search in compressed 
text [9]. They also showed that Eibonacci codes have better compression efficiency comparing with ETDC and 
SCDC while still being somewhat inferior in decompression and search speed even if byte aligned algorithms are 
applied. 

Evidently, the structure of a code strongly depends on the form of initial data representation. Note that in their 
constructions many integer encodings use two-parted information. Eor instance, the simplest Run-Length Codes use 
pairs {the count of a symbol in a run, symbol). The famous Elias [10], Levenshtein [11] and many other codes that 
use their own length [12] exploit the pairing integer information (bit length, binary representation). The Golomb 
[13] and the Golomb-Rice [14] codes use pairs (quotient, remainder) under integer division by a fixed number. 

So, we argue that many code constructions fit into the general scheme as follows: 

(i) According to some mathematical principle, each element of the input alphabet is put into one-to-one correspondence 
with the sequence of ordered integer pairs. Some relationships inside pairs and among pairs could be specified. 

(ii) Eor encoding pairs, some variable-length uniquely decodable function is chosen. 

(iii) To obtain the resultant codeword of a sequence of pairs, the corresponding codewords of pairs are concatenated 
in direct or reverse order. 

(iv) A special delimiter could be appended to the obtained binary sequence. 

This general scheme could be specified in many ways. One of such variants with the emphasis on splitting a 
code into simpler basic components is considered in this presentation. 

We introduce and study a family of binary codes that are derived from encoding sequences of ordered integer 
pairs with restrictions on one of the pair’s component. Namely, we consider the initial data representation of the 
form (Ai, ki)... (At, kt), where all integers At are upper bounded by some constant d, values ki are not bounded, 

0 < < d, 0 < ki, i = 1,... ,t. Each pair is encoded using the concatenation of two fixed independent prefix 

encoding functions applied to the corresponding components of a pair. A codeword consists of the sequential 
concatenation of those pair’s encodings. We call such codes splittable. Depending on tasks to be solved, one can 
choose a variety of coding functions to encode each pair (A, k) . This way we construct a code, which we call a 
(A, fc)-code. 

In the same way by using the dual representation (fci, Ai),..., (fct, A*), we define (fc, A)-codes. 
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The families of (A, k) and {k, A)-codes constitute the set of splittable codes. Giving such a name to considered 
codes we want to stress that the structure of a code reflects the splittable nature of the initial data representation by 
simpler integral parts. Splittable codes could be considered as a generalization of Golomb’s codes, which contain 
only one (fc, A)-pair. 

Splittable codes are well structured. Each codeword, including delimiters, is the concatenation of an integral 
number of corresponding (A, k) or (k, A)-pairing encodings. This regularity of a code structure also facilitates 
proving its important properties, such as completeness, universality, and density. 

In spite of the fact that (A, k) and (fc, A)-sequences carry the same information about coded data, their encodings 
could be very different. We prove that any Fibonacci code belongs to the class of (k, A)-codes and cannot be any 
(A, fc)-code. 

An important family of (A, fc)-codes are variable length codes with multiple delimiters. These codes are the main 
subject of our study. 

A delimiter is a synchronizing string that makes it possible to uniquely identify boundaries of codewords under 
their concatenation. In our case, each delimiter consists of a run of consecutive ones surrounded with zero brackets. 
Thus, delimiters have the form 01... 10. A delimiter either can be a proper suffix of a codeword, or it arises as the 
concatenation of the codeword ending zero and a codeword of the form 11... 10. The number of ones in delimiters 
is defined by a given fixed set of positive integers mi,i = 1,2,... ,f. The multi-delimiter code of that form is 
denoted by We prove that any multi-delimiter code is a (A,fc)- code and thus splittable. 

By their properties, multi-delimiter codes are close to Fibonacci codes of higher orders. We prove completeness 
and universality of those codes. There also exists a bijection between the set of natural numbers and any code 
This bijection is implemented by simple encoding and decoding procedures. For practical use, we 
present a byte aligned decoding algorithm, which has better computational characteristics than that of Fibonacci 
codes developed in [7]. 

As shown in [7], the Fibonacci code of order three, denoted by Fib3, is the most effective for the text compression.From 
our study it follows that the simple code D 2 with one delimiter 0110 has asymptotically higher density as against 
Fib3, although it is slightly inferior in compression rate for realistic alphabet sizes of natural language texts. 

We also note that by varying delimiters for better compression we can adapt multi-delimiter codes to a given 
probability distribution and an alphabet size. Thus, for example, we compare the codes 172,3, ^^ 2 , 3,5 and 772,4,5 
with the code Fib3. Those multi-delimiter codes are asymptotically less dense than Fib3. Nevertheless, alphabet 
sizes of the texts used in practice are relatively small, from a few thousands up to a few millions words. For texts of 
such sizes the mentioned above multi-delimiter codes outperform the Fib3 code in compression rate. The conducted 
computational experiment shows that, for example, the code 772,3,5 gives the average codeword length by 2 — 3% 
shorter than the Fib3 code when encoding the Bible and some other known texts. Even in encoding one of the 
largest up to date natural language text corpus of English Wikipedia, the code 772,3,5 is still superior as well as the 
codes 772,3 and 772,4,5. 

Multi-delimiter codes, like Fibonacci codes, are static codeword sets not depending on any probability distribution. 
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For a multi-delimiter code there exists an easy procedure for generating all words of a given length. Therefore, 
these codes allow an easy vocabulary representation for compression and decompression procedures. To create the 
vocabulary, one only needs to sort symbols according to the probabilities of their occurrences. 

Due to robust delimiters, multi-delimiter codes are synchronizable with synchronization delay at most one 
codeword. 

Properties of multi delimiter codes mainly rely on a finite set of special suffixes. Sets of words with a given fixed 
suffix, which cannot occur in other places of a word, are known as pattern codes. Properties of these codes such as 
synchranizability, completeness, universality, the average codeword length have been intensively studied [15]-[20]. 
Multi-delimiter codes even with one delimiter are not pattern codes, although they belong to the class of universal 
codes that are regular languages [19]. 

The structure of this presentation is as follows. Prior to the introduction of splittable codes, we precede with the 
consideration of two simpler codes of that type. In Section 3 with the purpose to show how (A, fc)-constructions 
arise in integer encodings, we briefly consider a specific integer representation using the two-base numeration system 
with the main radix 2 and the auxiliary radix 3. This representation yields a typical (A, fc)-code with restrictions 
given by inequalities 0 < A < 2, 0 < /c. This code is universal, but it is not complete. In section 4 we show that 
it can be embedded into the larger one-delimiter code set D 2 , which is complete. 

In section 5 we introduce splittable codes, and discuss (A, A:) versus (fc, A)-codes. We argue that (A, fc)-codes 
have some advantages comparing with (fc, A)-codes. That includes the possibility to form a wider variety of short 
codewords and more efficient codeword separation. 

In section 6 we introduce multi-delimiter codes We prove the mentioned above main properties of 

these codes: being a (A, fc)-code, completeness, and universality. 

A bijective correspondence between the set of natural numbers and the codewords of any code is 

established in the next section. For multi-delimiter codes we present simple algorithms for encoding integers and 
decoding codewords. With the purpose to accelerate the procedure of decoding we describe the general scheme of 
a byte aligned algorithm. Using the code D 2 as the representative of the considered family of codes a byte aligned 
decoding algorithm is presented in detail in Section 8. 

Comparative density characteristics of different multi-delimiter codes and the code Fib3 are given in Section 9. 

Our conclusion is the following. The introduced multi-delimiter codes form a rich adaptive family of robust data 
compression codes that could be useful in many practical applications. 

II. Definitions and notations 

By {0,1}* denote the set of all strings in the alphabet {0,1}. Let m be a non-negative integer. Denote by 1™ 
(respectfully O'") the sequence consisting of m consecutive ones (respectfully m zeros). 

The empty string corresponds to the value m = 0. 

A run of consecutive ones in a word w is called isolated if it is a prefix of this word ending with zero, or it is 
its suffix starting with zero, or it is a substring of w surrounded with zeros, or it coincides with w. 


DRAFT 


August 7, 2015 



5 


For a word w € {0,1}* its length is denoted by |r(;|. 

A code is a set of binary words. 

A code is called prefix (prefix-free) if no codeword could be a prefix of another codeword. 

A code is called uniquely decodable (UD) if any concatenation of codewords is unique. Each prefix code has 
UD property. 

A code is called complete if its any extension leads to not UD code. 

Let (Aq, ko)...{At, kt) be a sequence of ordered integer pairs, where 0 < Ai < d,0 < ki. For simplicity, in the 
sequel, pairs (A^, ki) of that type are called (A, fc)-pairs, and a sequence of such pairs is called a (A, A:)-sequence. 
Symbols A and k can be viewed as names of variables corresponding to values A^ and ki. 

We encode values A and k by some fixed prefix binary codes. The codeword of a (A, fc)-pair is the concatenation 
of codewords corresponding to parameters A and k. The codeword of a (A, fc)-pair is called the (A, fc)-group. 

In analogous way by changing the order in pairs we define {k, A)-pairs, (fc, A)-sequences, and (fc, A)-groups. 

Fibonacci numbers of order m > 1, denoted by are defined by the recurrence relation: 

p{m) _ p{m) _ Q < n < 0 . 

The Fibonacci code of order m, denoted by Fibm, is the set consisting of the word 1™ and all other binary words 
that contain exactly one occurrence of the substring 1™, and this occurrence is the word’s suffix [7]. 

For any n the Fibonacci code Fibm contains exactly codewords of the length n + m. 

III. Lower (2,3)-representation oe numbers 

Representation of numbers in the mixed two-base numeration system using the main radix 2 and the auxiliary 
radix 3 was first introduced in [21]. Prefix encoding of integers using this representation was studied in [22]. The 
so-called lower (2,3)-representation of numbers, which is a modification of the general (2,3)-representation, was 
introduced in [23]. Let us briefly describe its essence. 

Let N 2.3 be the set of natural numbers that are coprime with 2 and 3, x G N 2 , 3 , x > 1, n = [log 2 xj, 1 < m < n. 

A very simple idea stands behind the (2, 3)-integer representation. Note that for any whole positive number m 

integers 2™ and 2™“^ give different residues modulo 3. Therefore, x can be uniquely represented in one of the 

forms 2 ™ + 3*xi or 2 ™“^ + 3^xi, where xi also belongs to ^ 2,3 and fc > 1 . 

In the general (2,3)-representation of x the maximal value is chosen for m, m = [log 2 xJ. In the lower 

(2, 3)-representation we use the shifted value, m = [log 2 xJ — 1. Such a choice for m provides a more balanced 
form of the (2,3)-integer partition. Thus, any number x belonging to the set 1 ^ 2,3 can be uniquely represented in 
one of the forms 2"“^ + 3^xi or 2"“^ -f 3*xi, where xi G N 2 , 3 , xi < x, fc > 1. Applying the same decomposition 
procedure to xi, we obtain the remaining number X 2 . In general, at the /-th stage of the iterative procedure, we 
get the remaining number Xi+i, such that Xi = 2"* + 3^‘Xi+i, where rii = [log 2 XiJ — 1 or rii = [log 2 x^J — 2. 
Continue this process recursively until at a certain iteration f — 1 we obtain xt = 1 or xt = 2 (in the last case 
Xt-i = 7 = 2° + 3-2). 
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A lower (2, 3)-code is defined as any code in the binary alphabet {0,1} that can be used to restore the sequence 
of values ■ ■ ■ ,xi,x. One of such codes we obtain using the so-called (A,fc)- approach. 

Note that for the unambiguous reconstruction of the number x it is sufficient to keep the sequence of pairs given 
by the values = \log 2 i^'Xij^i\ — rii and f = 0,...,f — 1. These pairs we obtain at each iteration during 
decomposition of x. For the lower (2, 3)-representation the following remarkable property holds. The defined above 
parameter A^ can take only three values: 0,1 and 2 [23]. 

So, with a number x the numerical sequence of pairs is uniquely associated (Ag, fcg), (Ai, fci),..., (At_i, kt-i), 
where 0 < A^ < 2,0 < 

For the lower (2, 3)-encoding, we use the specific binary encoding of pairs. The value A is encoded as follows: 
A = 2 by the symbol 0, A = 1 by the word 11 and A = 0 by the word 10. The value k is encoded by the word 
l^-iQ with some exceptions arising due to the selection of a delimiter. In these exceptional cases, the codeword 
for k is 1^0. 

The codeword of a number x is the sequential concatenation of the corresponding (A, A:)-groups. For the lower 
(2,3)-code encoding groups are written in the reverse order regarding the way of obtaining them during encoding, 
(At_i, kt-i ),..., (Ag, fcg). This allows to perform the decoding from left to right and makes it easier. 

Since every (A, A:)-group, and each codeword ends with the symbol 0, then the word 0110 can serve as a delimiter. 

To form the delimiter, it is necessary to append the string 110 to the end of some words. If in a codeword the 
last group corresponding to the pair (Ag, fcg) takes the form 0110 or 10110, i.e. fcg = 3 and Ag ^ 1, then it already 
contains the delimiter, so there is no need to postfix the string 110 to the end of a word. 

Thus, the (A, A:)-groups 110, 0110, 10110 are separating ones; if any of them occurs, a codeword ends with it. 
In a codeword the last group 110, which is externally appended, does not correspond to any pair (A, k) that take 
part in the lower (2,3)-representation, and has to be ignored during decoding, but groups 0110 and 10110 have to 
be taken into consideration. So, none (A, fc)-group that corresponds to a pair should not take the form 110, and 
none (A, fc)-group except the last one, should not take the forms 0110 or 10110. However, codewords of pairs 
{Ai,ki) received in the lower (2, 3)-factorization can violate these conditions. Namely, this undesirable situation 
occurs when: 

1) A = 1 and k = I (then the group 110 is formed); 

2) A ^ 1, A: = 3 and the corresponding (A, fc)-group is not the last one (it is one of the groups 0110 or 10110). 

It is easy to check (and this is shown in [23]) that for the group (Ai_i, kt-i), which is written first in a codeword, 

case 1) is impossible. Therefore, to avoid the undesirable situation mentioned above, instead of 1^“^0 we encode 
the value fc in a (A, fc)-group by the string 1^0 in such cases: 

A = 1 and a (A, fc)-group is not the first; 

A ^ 1, fc > 3 and a (A, fc)-group is not the last. 

In this way, the constructed prefix code corresponds to the set of positive integers that are coprime with 2 and 3. 
The number 1, for which the lower (2, 3)-factorization is empty, corresponds to the shortest codeword 110. Together 
with the last zero of a preceding codeword this sequence forms a delimiter. 
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TABLE I 

Lower (2, 3)-representations and codewords of the first fifteen numbers 


n 

X 

(Ao, fco) 

Xi 

(Ai,fci) 

a:2 

code 

1 

1 





no 

2 

5 

0,1 

1 



100 110 

3 

7 

2,1 

2 



00 no 

4 

11 

2,2 

1 



010 no 

5 

13 

1,2 

1 



1110 no 

6 

17 

0,2 

1 



1010 no 

7 

19 

1,1 

5 

0,1 

1 

100 1110 no 

8 

23 

0,1 

5 

0,1 

1 

100 100 no 

9 

25 

2,1 

7 

2,1 

2 

00 00 no 

10 

29 

1,1 

7 

2,1 

2 

00 1110 no 

11 

31 

2,3 

1 



0110 

12 

35 

1,3 

1 



lino no 

13 

37 

0,1 

7 

2,1 

2 

00 100 no 

14 

41 

2,1 

11 

2,2 

1 

010 00 no 

15 

43 

0,3 

1 



10110 


By 6 * 2 °“ we denote the lower (2,3)-code described above. 

To encode an arbitrary positive integer n, it is necessary to find the n-th number in the ascending series of 
numbers that are coprime with 2 and 3. This number equals to a: = 3n — (n mod 2) — 1. Thus, to encode n, one 
have to find the lower (2, 3)-representation of x and encode it. 

Table I shows 15 smallest numbers, their lower (2,3)-representations, and the corresponding codewords of the 
lower ( 2 ,3)-code. 

As it was mentioned above, the last element in the lower (2,3)-representations is the number xt = 1 or cct = 2. 
Hence, decoding starts from one of these numbers. Then the sequence of numbers xt, ■ ■ ■, xi, xq = x is calculated. 
It is processed as follows. Using the values and ki we calculate Ui = [lop23^*a;i+iJ — A^, and hence we 

can obtain Xi = 2"* + 3^'Xi+i. Note that xt = 2 if and only if Ai_i = 2 and kt-i = 1; in other cases xt = 1 
[23]. Thus, there is no ambiguity at the starting point of the decoding procedure. 

IV. CODE £>2 

The existence of a delimiter for the code ( 72 °“ means that this code is prefix-free. However, it is not complete, 
i.e. the set of its codewords can be expanded while its UD property will not be lost. To demonstrate that, we 
construct a prefix code that contains all the codewords from C!f^, and some more. 

This code is quite simple to define. It consists of the word 110, and all other binary words that do not start with 
the string 110, ends with the sequence 0110 and do not contain this sequence as a substring in other places. We 
denote this code by I? 2 - The number 2 in the code notation indicates that its delimiter contains 2 consecutive ones. 
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Obviously, the code D 2 contains all the codewords of the code C!f^ and has the same delimiter 0110 as the 
code 

Each portion of concatenated codewords from D 2 ends with the delimiter string 0110 that makes it possible to 
unambiguously determine the beginning of a new codeword in the flow of codewords. 

This also provides synchronizability of the code. In case of errors occur a receiver has only to identify the first 
delimiter string 0110 to renew the code parsing. But in some cases it cannot unambiguously identify the delimiter 
suffix 110 as the single codeword. 

The example of a word belonging to the code D 2 , but not to is 10000110. If we apply the (2,3)-decoding 
procedure to this string, we obtain the number 17. However, as Table I shows, the codeword for 17 is 1010110. 

Thus, the code C ^2 3 complete. By the contrast, the code D 2 is complete, as a representative of a wider 

class of complete codes that will be defined and investigated in the following sections. 

V. Splittable codes 

In the lower (2,3)-integer representation, we use sequences of (A, A:)-pairs. Let us change the order of A and k 
inside pairs. In this way, the dual sequence of (fc, A)-pairs {ki,Ai), where ki is an arbitrary positive integer, and 
Ai takes the same values 0, 1 or 2, can also be associated with a number. 

Apart from the above-mentioned, this representation allows other binary prefix encodings including the following. 
We represent the value k as the word 0^“^1 in the unary numeration system with 1 as a separator and the 
value A in the form 1^0. The concatenation of codewords corresponding to ki and A^ respectively constitutes a 
{k, A)-group. The codeword of a {k, A)-sequence is formed by the concatenation of corresponding {k, A)-groups 
appended by the delimiter string 1111. It is obvious that in the concatenation of (fc, A)-groups obtained through 
the (2, 3)-decomposition that word does not occur. 

In the lower (2,3)-integer representation, not all possible (A:, A)-sequences are valid. Let us abstract ourselves 
from the semantics of values k and A, as parameters of the lower (2, 3)-factorization. Using the defined above atomic 
encoding of {k, A)-pairs we consider encoding all possible sequences of (fc, A)-pairs {ki, Ai)(A: 2 , A 2 )... {kt, At), 
where the following restrictions hold; 0 < A^ < 2, 0 < /c. It is easy to see that the obtained set of codewords is 
nothing more than the code Lib4, named in [5] as the code Ci of the order 4. 

In this way varying upper bounds for values A, 0 < A^ < m, and, respectively, the quantity of ones in a code 
delimiter we obtain different Libonacci codes. So, if A can take only one value (which is encoded by ”0”) and 
the delimiter consists of two ones, then we obtain the code Lib2. If A can take two values, which we encode by 
words ”0” and ”10”, then the delimiter consists of three consecutive ones, and we have the code Lib3. Overall, in 
Libonacci codes a restriction on the set of A-values naturally predetermines a delimiter. If A can take no more 
than TO different values, then the delimiter is the run of to -f 1 ones. 

Thus, we can assume that the lower (2,3)-code, the popular Libonacci codes and possibly some others can be 
viewed as the different realizations of a more general method of number encoding based on encoding sequences of 
ordered integer pairs with limitations on one of their components. 
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From a practical point of view, it is also important that a code contains a sufficient number of short words. This 
means that if we consider a code with delimiters, the delimiters or their prefix parts should be included in some 
short sequences of (A, k) or (k, A)-groups. The longer codewords can contain these shorter words as suffixes and 
thus we may not consider delimiters apart from codes of (A, k) (or (fc, A))-sequences. Summarizing all the above 
mentioned, we come to the following definition of (A, fc)-codes. 

Definition 1. Let S be a given set of sequences of (A, k)-pairs, where A is a non-negative integer that does not 
exceed some constant d, and k can be any positive natural number. A (A, k)-code of S is the set of binary words 
that satisfy the following conditions: 

(i) values A and k are encoded by separate independent prefix encoding functions (fi and pi respectfully; 

(ii) the encoding of a (A,k)-pair is defined as the concatenation pi{A)p 2 {k), which we call a {A,k)-group; 

(iii) the codeword of a (A, k)-sequence from S is the sequential concatenation of the corresponding (A, k)-groups. 

A (A, fc)-code is any set of binary words that can be interpreted as a (A, fc)-code for some set S of (A, fc)-sequences. 
Thus, to set a (A, fc)-code it is necessary to specify a set S of (A, fc)-sequences and to choose well defined basic 
encodings of (A, fc)-pairs. 

In what follows, we consider only codes, where a set S is the set of all possible (A, fc)-sequences. In general, 
like in the case of (2,3)-codes, a basic set S could be a subset of all (A, fc)-sequences. 

The definition of a (k, A)-code is similar to that given above by changing (A, k) by (fc, A)-pairs. 

We call both the (A, k) and {k, A)-codes splittable codes. 

The important property of splittable codes is that any codeword, including a delimiter, consists of a whole number 
of (A, k) (respectively {k, A))-groups. This structural regularity can also be used as an element of proving technique 
in establishing important code properties, such as completeness, universality, and density. 

As shown above, the codewords of Fibonacci codes can be represented as sequences of (fc, A)-groups, which are 
externally supplemented by a delimiter. Interestingly, that using specific encodings of k and A, these codewords can 
be interpreted as the sequences consisting of a whole number of (fc, A)-groups even with a delimiter. Nevertheless, 
they cannot be given as the sequences of (A, fc)-groups. 

Theorem 1. Any Fibonacci code Fibm is a (fc, A)-code, but not a (A, k)-code. 

Proof: Consider a {k, A)-pair, where k could be any positive integer, and A can have only m different values, 

0 < A < m. Let us encode k by the string which comprises k — 1 zeros. Values of A we encode by m 

strings: 0,10,..., which contain runs up to m — 2 ones, and the string corresponding to the value 

m — 1. 

Using this encoding we prove the first part of the theorem statement by induction on the codeword length. 

Let a be a codeword from Fibm. The minimal possible length of a is equal to m. If that is so, a = 1™ = 

This string corresponds to the (fc, A)-pair (l,m— 1). 
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Suppose that the statement of the theorem holds for all codewords having lengths less or equal to some integer 
t, t> m. Assume that the length of a is f + 1. 

If a starts with 1, then a can be represented in the form a = 1*0/3 = ll*“^0/3, 0 < i < m. The prefix 
corresponds to the {k, A)-pair (1, i — 1). The shorter string /3 also belongs to Fibm. Thus, by the inductive 
assumption /3 comprises an integral number of (fc, A)-groups. 

Consider the case when a starts with 0, a = 0*1/3, i > 0. If /3 is the suffix of the form I™”! then a = 0*11***“^, 
and that corresponds to the (fc, A)-pair (i + l,m — 1). 

In another case, /3 is a string of the form /3 = 1 -^ 07 , 0< j<m — 1,76 Fibm. This gives the representation form 
a = 0 *ll-' 07 . The prefix part 0*11^0 is the codeword corresponding to the (k, A)-pair (i + 1, j.) By the inductive 
assumption the string 7 contains a whole number of (k, A)-groups. Hence, a corresponds to some (k, A)-sequence. 
By induction the first part of Theorem 1 is proved. 

Consider the second part of the theorem. Suppose, to the contrary, that Fibm is a (A, fc)-code with some prefix 
encoding functions ipi for A-values and ip 2 for /c-values. 

For any integer k the codeword 0^1"* belongs to Fibm. On the other hand, the lengths of codewords corresponding 
to A values are restricted. It follows that there exists the value A' such that = 0® for some integer s > 0. 

Consider the word 0*1"*. The prefix property of the encoding (pi implies that there are no other codes of A of 
the form 0’',r < s. It follows that there exists some value k' such that (f 2 {k') = l*,t > 0, and (pi{A')(p 2 {k') is 
the first (A, fc)-group for the string 0*1"*. 

Consider the string 1"*. It also belongs to Fibm. By our assumption, some (A, fc)-groups constitute the representation 

1 *** = ^pi{Ai)(p2{ki)...^pi{An)(p2{kn)- 

The prefix property of encodings cpi and (p 2 implies that Ai = A 2 = ... = A„, k' = ki = k 2 = ■ ■ ■ = kn, 

ipiiAi) = r,r > 0 ,ip 2 iki) = l\t > 0. 

It immediately follows that the inequality t < m holds. 

Thus, from the consideration of the string 0®1"* we conclude that the non-empty string 1 ™“* consists of a whole 
number of identical (A, fc)-groups. Each of them corresponds to the pair (Ai, ki). 

The string 1"* can be represented in the form 1*** = It follows that the string 1* should be represented 

using an integral quantity of identical (A, fc)-groups corresponding to the encoding Lpi{Ai)Lp 2 iki) = !*'+*, r > 0. 
This contradiction concludes the proof. ■ 

For Fibonacci codes considered as (fc, A)-codes we use the unary encoding of parameters k and A. Note that 
when we use splittable codes for data compression, then they can be more effective, if the average codeword 
length is shorter. From this perspective, the encoding of parameters k and A in the unary numeration system is 
not economical. More economical, for example, is the truncated binary encoding of the values A and k. However, 
for the parameter k such encoding is impossible since the set of its values is unlimited. Nevertheless, the truncated 
binary encoding can be applied to encode the values of the parameter A. 

Concerning the parameter k, there are only two unary prefix encodings or Theoretically, other 

prefix encodings, such as Elias codes [10] can be used for encoding k. However, in applications of splittable codes 
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to text compression, the probability distribution of fc-values is geometric, and unary codes are the most effective 
for this kind of distribution. 

The Golomb codes [13] completely correspond to the principles described above. Those are ones of the simplest 
{k, A)-codes, where each codeword consists of one (fc, A)-group. 

If we consider more complex codes, which codewords can contain several (A, k) or (fc, A)-groups, then certain 
groups should be considered as terminating in a codeword, i.e. separating ones. We note that due to the unary 
encoding of the parameter k, the last bit of any (A, fc)-group always has the same value, say zero. Therefore, to 
endow a splittable code with the feature of instantaneous separation, it is suitable to construct a code from (A, k)-, 
but not (fc, A)-groups, predetermining a delimiter as OaO, where aO-is a separating group, and zero in front of it is 
the last symbol of the previous group. If we encode A in the binary form, then (fc, A)-groups will not have such 
properties, because they can begin and end with zero as well as with one. This complicates finding the place that 
matches a delimiter. 

However, the more important advantage of (A, fc)-codes over (fc, A)-codes is the possibility to form short 
codewords that do not contain a whole delimiter. For example, they can consist of a separating group of the 
form aO, while the delimiter takes the form OaO. Longer delimiters provide the better asymptotic density of a code, 
while short codewords enable us to organize efficient compression for relatively small alphabet sizes. Thus, for 
example, the considered above code D 2 , it will be proved further that it is a (A, fc)-code, contains the word 110, 
although the sequence 0110 is the code delimiter. As will be shown, it has a higher asymptotic density than the 
code Fib3, and only slightly inferior in the efficiency of compressing texts with small alphabets. 

VI. Multi-delimiter codes 

One of the families of efficient (A, fc)-codes can be obtained by using several delimiters of the form 01™0 in 
one code. The remaining part of this presentation deals completely with the investigation of these codes. 

Let Ai = {mi,..., rrit} be a set of integers, given in the ascending order, 0 < mi < ... < mt- 

Definition 2. The multi-delimiter code consists of all the words of the form = 1,... ,t and all 

other words that meet the following requirements: 

(i) for any rrii G M a word does not start with a sequence 1™*0; 

(ii) a word ends with the sujfix 01'"*0/or some rrii G M.; 

(iii) for any mi G M a word cannot contain the sequence 01'"* 0 anywhere, except a suffix. 

The given definition implies that code delimiters in Djni,...,mt sequences of the form Ol^'O. However, the 
code also contains shorter words of the form 1"‘*0, which form the delimiter together with the ending zero of a 
preceding codeword. 

Evidently, any multi-delimiter code is prefix-free and thus UD. 

Table II shows examples of multi-delimiter codewords. This table lists all codewords of lengths not longer than 
7 of different multi-delimiter codes and, for comparison, Fibonacci codes Fib2 and Fib3. 
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TABLE II 

Sample codeword sets oe some multi-delimiter and Fibonacci codes 


Index 

Fib2 

Di 

Di,2 

Fib3 

D2 

D2,3 

02,3,4, 

1 

11 

10 

10 

111 

110 

no 

no 

2 

Oil 

010 

010 

0111 

0110 

0110 

0110 

3 

0011 

0010 

110 

00111 

00110 

1110 

1110 

4 

1011 

00010 

0010 

10111 

10110 

00110 

00110 

5 

00011 

11010 

0110 

000111 

000110 

10110 

10110 

6 

01011 

000010 

00010 

010111 

010110 

OHIO 

OHIO 

7 

10011 

011010 

00110 

loom 

100110 

000110 

lino 

8 

000011 

110010 

000010 

110111 

0000110 

010110 

000110 

9 

001011 

111010 

000110 

0000111 

0010110 

100110 

010110 

10 

010011 

0000010 

111010 

0010111 

0100110 

001110 

100110 

11 

100011 

0011010 

0000010 

0100111 

1000110 

101110 

001110 

12 

101011 

0110010 

0000110 

1000111 

1010110 

0000110 

101110 

13 

0000011 

1100010 

0111010 

1010111 

mono 

0010110 

011110 

14 

0001011 

0111010 

1110010 

0110111 


0100110 

0000110 

15 

0010011 

1110010 

mono 

1100111 


1000110 

0010110 

16 

0100011 

1111010 

1111010 



1010110 

0100110 

17 

1000011 





0001110 

1000110 

18 

0101011 





0101110 

1010110 

19 

1001011 





1001110 

0001110 

20 

1010011 






0101110 

21 







1001110 

22 







0011110 

23 







1011110 


The codes £> 2,3 and £> 2 , 3,4 with 2 and 3 delimiters respectfully contain many more short codewords than both 
the Fibonacci code Fib3 and the one-delimiter code £> 2 . However, as it will be demonstrated in the following, the 
asymptotic density of these codes is lower. 

Overall, codes with more delimiters have worse asymptotic density, but contain a larger quantity of short 
codewords. This regularity is related also to the lengths of delimiters: the shorter they are, the larger quantity 
of short words a code contains. 

For natural language text compression, the most effective seems to be codes with the shortest delimiter having 
two ones, which we will thoroughly examine. 

Now we demonstrate that multi-delimiter codes belong to the class of splittable codes. 

Theorem 2. Any multi-delimiter code £>mi,...,mt £ (A, k)-code. 

Proof: We need to set some positive integer that cannot be exceeded by the value of A and construct prefix 
encodings for A and k so that any codeword of £>mi,...,mt comprises a whole number of (A, fc)-groups. 

Let d be some fixed non-negative integer satisfying inequalities 0 < d < mi. The parameter A ranges from 0 to 
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2"^ + 1. We encode these values by the symbol 0 and all binary words of the length (i+ 1 with the fixed first symbol 
1. The value of the parameter k, which can be any positive integer, is encoded by the word Evidently, these 

encodings of values A and k are prefix-frree. 

Consider a word 1’'0, where r > mi. This word can be represented in the form 1’'0 = The 

inequality r > mi and the choice of d implies that r > d + 1. It follows that 1’’0 corresponds to the (A, A:)-pair 
with A encoded by and k = r — d > 0 and any word a € of the form 1’'0 represents some 

(A, fc)-group. 

Note that for any binary word a of the length exceeding d and containing zeros in its representation it is possible 
to choose a prefix, such that it can be interpreted both as a codeword of some value A, and as a codeword of some 
value k. Indeed, if a starts with 0 then this symbol can be interpreted as corresponding to A = 0 or fc = 1. If 
a starts with 1 then a = l’'0/3, where r > 0 and /3 is the binary word. The prefix 1’'0 can be interpreted as the 
codeword of the value k = r + 1. But, also it is possible to choose the prefix of a having the length d + 1, which 
corresponds to some value of A. 

Now, suppose that a € it does not have the form 1’'0. Let us consider parsing the codeword a 

from left to right sequentially extracting corresponding (A, A:)-groups until it is possible. As the result, we make 
partitioning of a on a whole number of (A, A:)-groups or we obtain a remainder that is not capable of containing 
a whole number of (A, A:)-groups. 

In the first case we obtain the desirable partitioning of a on an integral number of (A, A:)-groups. 

Consider the case of obtaining a remainder. Let us examine how under this procedure the ending of a codeword 
is processed. The suffix of a codeword has the form 01™‘0 and contains at least mi ones. The first bit ”0” of that 
suffix either can be the ending of some codeword of k or can belong to a codeword of A. In the first case, at the 
last iteration we obtain the residue l^'O with no less than mi ones that, as shown above, is a (A, fc)-group. In the 
second case, we note that the codeword of A comprises no more than mi bits and after its extraction we obtain 
the remaining sequence of the form 1... 10, which represents a particular value of k. Thus, the situation when at 
the last iteration we obtain a remainder, which is not capable of containing a whole (A, fc)-group, is impossible. ■ 

Note that Theorem 2 holds for any values d that satisfy the inequalities 0 < d < mi. In the sequel to further 
simplify considerations, we presume that d = 0, i.e. the code of a A-value comprises one bit. 

Note that although in the code D 2 we used the encoding of three possible values of A, which corresponds to 
the value d = 1, all words of that code can be also represented as (A, fc)-groups with a single-bit encoding of A. 

Theorem 3. Any code A complete. 

Proof: A necessary and sufficient condition for a code C to be complete is given by the Kraft-Macmillan 

equality: ^ = 1. By /„ denote the number of codewords of the length n. This equality can be rewritten as: 

cec 

00 

= 1 (1) 

n=l 
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Consider the multi-delimiter code 

Theorem 2 allows us to choose the one-bit encoding for A, and k is encoded by 

For any n > 2 there exist two (A, fc)-groups of length n: and 01"“^0. Among all of them (A, fc)-groups 

that include rrii ones, i = 1,... ,t, are terminal, i.e. they can occur only at the end of a codeword. Thus, for the 
code there are 2t terminal groups having lengths mi + 1 ,mi + 2,... ,mt + l,mt + 2. 

By T„ denote the number of terminal groups of the length n. Evidently, r„ equals to the number of occurrences 
of n in the set {mi + l,mi -I- 2,..., m^ -f 1, m^ -f 2}. This number can be equal to 0,1 or 2. The number of 
non-terminal groups of length n equals to 2 — T„. 

Consider the codewords of the length n that contain at least two (A, fc)-groups. Each such word can be obtained 
by prepending its first non-terminal (A, A:)-group to a shorter codeword. On the other hand, prepending an arbitrary 
non-terminal group to any codeword forms a longer codeword. If the codeword contains only one (A, fc)-group, 
then this group is terminal. Thus, taking into account that the length of the shortest (A, A:)-group is 2, we obtain 
the following recurrent formula for calculating the number of codewords of the length n: 

n—2 

/n=T„ + ^( 2 -T„_fc)/fc = 

= Tn -\- 2(/n-2 “h /n-3 + ***)” 

fn — {mx-\-l) * * ‘ (mt + l) 

fn — {m\+2) * * * fn — {mt+2) (2) 

Let us apply this formula to calculate fn-i- 


n—3 

fn-l = Tn-i + ~ Tn-l-k)fk = 

fc=0 

= Tn-l + 2(/„_3 -f fn-i + •••)” 

fn — {m\-\-2) * * * fn—{mt-\-2) 

/n —(mi+3) * ■ * /n —(mt+3) (^) 

Find the right part of (3) in (2) and change it to fn-i- 


fn — Tn — Tn-l + ‘^fn-2 + fn-l ~ 
fn—mi—1 * * ‘ fn—mt — 1 

+/n-mi-3 + * * * + /n — mt —3 (4) 


Denoting the left part of (1) by s and taking into account that fo = f-i = • • • =0, for any p > 0 we have the 
following equalities: 2“”/„-p = 2.-PJ2’^=i 2~^^~P'>fn-p = s2“P. 


DRAFT 


August 7, 2015 



15 


Taking them into consideration and substituting expression (4) in (1), we obtain the following: 


= ^ 2-"/„ = ^ 2-"(T„ - r„_i + /„_i + 


n—1 


n—1 


‘^fn — 2 /n —(mi + 1) * * * fn—{mt-\-l) 


/n—(mi+3) H“ * ‘ * H“ /n—(mt+3) 

1 
2 


= ^ 2-"T„ - - ^ + 

n—1 

1 1 




n=l 
— mi —1 


. _ + 

_^2-mi-3 _^2-™*-3) (5) 

Taking into account that 2-™‘-3 - 2-^'-^ = -3 • 2-™‘-3 for any i, J2n^i 2-”r„ = J2n=i and 

cancelling out s in both parts of (5) we obtain the following formula. 


t cx; 

3s^2-’”^-3 = -^2-”r„ (6) 

i—1 n—1 

Since the lengths of terminal (A, fc)-groups are mi + 1, mi + 2,..., mt + 1, mt + 2, the equality 


^2-"r„ = ^2 


t g t 


n—1 




—mi — 1 I r) — mi—2 _ H \ ' o — 

4- 


+ 2- 






is satisfied. 

Therefore, equality (6) takes the form 


-s V2-™^ = - V 2- 

i—1 i—1 


That implies the condition s = 1. ■ 

Also the (A, fc)-structure of multi-delimiter codes enables us to prove another important feature, universality, but 
we give the simpler proof based on encoding integers. 


VII. Encoding integers 

We define a multi-delimiter code as a set of words. There exists a simple bijection between the set of natural 
numbers and the set of codewords of any multi-delimiter code. Thus, it enables us to encode integers by codewords 
of these codes. 

Let Ai = {nil,... ,mt} be the set of parameters of the code By = {ji,j 2 ,---} denote the 

ascending sequence of all natural numbers that does not belong to Ai. 

Example. Let Ai = {2,5}. This gives the set = (1,3,4,6, 7,8,...}. 

By cpMi'i) denote the function (fiMi'i) = ji G as defined above. 

It is easy to see that the function is a bijective mapping of the set of natural numbers onto N^vi. Evidently, this 

function and the inverse function can be constructively implemented by simple one cycle iterative procedures. 
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The main idea of encoding integers by the code is as follows. We scan the binary representation of 

an integer from left to right. During this scan each internal isolated group of i consecutive Is is changed to (fMi'i) 
Is. This way we exclude the appearance of delimiters inside a codeword. In decoding we change internal isolated 
groups of j consecutive Is to the similar groups of ones. Detailed description of the encoding procedure 

is as follows. 

Bitwise Integer Encoding Algorithm. 

Input: X = a:„x„_i...Xo, Xi € {0,1},= 1; 

Result: a codeword from 

1) X X — 2", i.e. extract the most significant bit of the number x, which is always 1. 

2) If X = 0, append the sequence to the string x„_i...xo, which contains only zeros or empty. Result 

•<— x„_i...xol’”^0. Stop. 

3) If the binary representation of x takes the form of a string 0’'1™*0, r > 0,mi G Ai,i > 1, then Result G- x. 
Stop. 

4) In the string x replace each isolated group of i consecutive Is with the group of pM{i) consecutive Is except 

its occurrence as a suffix of the form > 1. Assign this new value to x. 

5) If the word ends with a sequence > 1, then Result G- x. Stop. 

6) Append the string Ol^^O to the right end of the word. Assign this new value to x. Result G- x. Stop. 

According to this algorithm, if x ^ 2", the delimiter Ol^^O with mi ones is attributed to a codeword externally, 
and therefore it should be deleted during the process of decoding, while the delimiters of a form > 1 are 

informative parts of codewords and they must be processed during the decoding. If x = 2", the last mi + 1 bits of 
the form must be deleted. 

Bitwise Decoding Algorithm. 

Input: a codeword y G Dmi,...,mf 

Result: an integer given in the binary form. 

1) If the codeword y is of the form O^l^^O, where p > 0, extract the last mi + 1 bits and go to step 4. 

2) If the codeword y ends with the sequence Ol^^O, extract the last mi + 2 bits. Assign this new value to y. 

3) In the string y replace each isolated group of i consecutive Is, where i G A4, with the group of 
consecutive Is. Assign this new value to y. 

4) Prepend the symbol 1 to the beginning of y. Result G- y. Stop. 

The following lemma gives an upper bound for the length of a multi-delimiter codeword. 

Lemma 1. Let ^ multi-delimiter code, Ci be the codeword of an integer i obtained by the encoding 

algorithm given above. The length of Ci satisfies the following upper bound: |ci| < iflog 2 * + + 2. 

Proof: The encoding procedure that transforms a number i given in binary form into the corresponding codeword 
of the code can enlarge each internal isolated group of consecutive Is maximum on t ones. The quantity 
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of such groups does not exceed ^ log2 i. To some binary words the delimiter 01™"^0 could be externally appended. 
Therefore, the length of the codeword for i is upper bounded by the value log2 i + toi +2. ■ 

Now we are ready to prove that any multi-delimiter code is universal. 

The concept of universality was introduced by P. Elias [10]. This notion reflects the property of prefix sets to be 
nearly optimal codes for data sources with any given probability distribution function. 

A set of the codewords of lengths li{li < I 2 < ...) is called universal, if there exists a constant K, such that 
for any finite distribution of probabilities P = (pi,... ,p„), where Pi > P 2 ^ the following inequality holds 

n 

kpr < K ■ max(l, E{P)), (7) 

i=l 

where E{P) = — P* ^t)g2 Pi is the entropy of distribution P, and AT is a constant independent of P. 
Theorem 4. Any multi-delimiter code E universal. 

Proof: Like in Lemma 1, by q denote the codeword in corresponding to the integer i. Let us sort 

codewords from in the ascending order of their bit lengths, oi, 02 ,.... Map them to symbols of the input 

alphabet sorted in the descending order of their probabilities. 

We claim that the length of any word Oi also satisfies the length upper bound for |ci| given by Lemma 1. 

Indeed, consider the set {ci, C2 ,..., }. Obviously, each of its elements satisfies that upper bound. In the sequence 

01,02 ... at least one element, say Cj,l < j < i, occupies the place k such that k > i,ak = Cj. This implies 
loil < lofcl = \cj\- It follows that |ai| satisfies the upper bound for \ci\, which is equal to if /op2* + + 2 as 

Lemma 1 stated. 

The sequence 01,02 ,... can be considered as a new encoding of natural numbers. To conclude the proof it remains 
only to apply the general Lemma 6 by Apostolico and Lraenkel taken from [5]: ’’Let 1/; be a binary representation 
such that |'!/'(fc)| < Cl + C2 logfc {k G Z+), where ci and C2 are constants and C2 > 0. Let pk be the probability to 
meet fc. If pi > p2 > ... > Pn, — I ip is universal”. ■ 

VIII. Byte aligned algorithms 

The considered above encoding and decoding algorithms are bitwise, and therefore they are quite slow. We 
can construct accelerated algorithms that process bytes. Since decoding is performed in real time more often than 
encoding and in general lasts longer, acceleration of decoding is a more important task we focus on. 

The general idea of the byte aligned decoding algorithm is similar to that one described in [7] for the Libonacci 
codes. At the /-th iteration of this algorithm, a whole number of bytes of encoded text is read out. We denote 
this portion of text by ut. Assume that Ui has the form SiE{w \),..., E{w\)ri, where E is an encoding function; 
E{w \),..., E{w\P) are the codewords of numbers w\,... ,wl. \ Si'ts the beginning of the text Ui that does not contain 
a whole codeword; and n is the remainder of text m that does not contain a whole codeword. 
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TABLE III 

Decoding table for bytewise method for the code D2 


n-i 

U 

Wl 

bil 

fi 

W2 

\ W2 \ 

/2 

W3 

hsi 

/a 

n 


11000111 


0 

1 

0011 

4 

0 




1 

1 

01101011 


0 

1 

1 

1 

0 




on 

on 

11001011 

0111001 

7 

0 







on 

on 

11101101 

01111 

5 

1 


0 

0 




1 

1 

10011000 


0 

1 

0 

1 

1 

00 

2 

0 



As easy to see, the values wl,... ,wl as well as the remainder can be unambiguously determined by Ui and 
the remainder of the previous portion of bytes. Thus, we consider Ui and as indices of predefined arrays 
Wl, W 2 , ■ • ■, Wk, R containing the corresponding decoded numbers and a remainder, 

Wi[ri-i,Ui] =w\,..., Wk[n-i,Ui] = wl, R[n-i,Ui] = n. 

We get decoded numbers directly from these arrays. 

Note that the concatenation ri-iSi is also a codeword, if it is not empty. Some bits from the beginning of the 
number E~^{ri-iSi) may be unambiguously obtained at the (i — l)-th iteration while others are obtained at the 
i-th iteration. Thus, we can make correction assuming that w\ and could be not the fully decoded numbers, 
but also the ending or the beginning of the decoded number binary representation respectfully. Values w\,... ,wl 
corrected in this way we denote by wi,..., Wk, eliminating the index i for simplicity. Therefore, by we denote 
the ending of the text Ui, which cannot be decoded unambiguously at the /-th iteration. Also, note that there is no 
need to store the first bit of numbers wi,..., Wk, because it is always equal to one. 

To illustrate how the method works, we apply this general byte aligned algorithm for the code D 2 , assuming that 
at each iteration one byte is processed. The arrays Wi, ...Wk are stored in the predefined table. Some rows of this 
table are shown in Table III. The shortest codeword of D 2 has the form 110. This implies that with little exception 
one byte can encompass no more than three full or partial codewords from D 2 . The only option when the byte can 
cover four codewords fully or partially is the case OllOllOx, where x is the last bit of the byte and the first bit of 
the fourth codeword. This bit can be attributed to the unprocessed remainder r, and thus it is enough to store three 
resultant numbers. 

Together with the numbers wi, W 2 , W 3 and the remainder r we store the following values in each row of the 
table: Ircil is the length of the /-th number in bits (excluding the first bit); fi is the flag signaling if the codeword 
Wi is the last in the current byte {fi = 0) or not {fi = 1). 

Under the heading of Table III there are rows written from top to bottom, which are used to decode the coded 
text 11000111 01101011 11001011 11101101 10011000. 

The structure of the second byte is shown in Fig. 1. 

Let us examine the set of possible values of the remainder r. First, let us make the following comments: 

1) If some (A, fc)-group is a part of the byte composition, then it can be unambiguously decoded regardless of 
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Ti-l 

z-th byte 


1 

0 

1 

1 

0 

1 

0 

1 

1 


E{wi) 

E{w2) 

Vi 


Fig. 1. Parsing of the byte 01101011 

the next byte content, and, therefore, its bits will not be included in r. 

2) If the byte ends with p > 3 consecutive ones, then they will be decoded as p — 1 ones regardless of the next 
byte content. In this case, the string r consists of the last 1, which during the decoding of the next byte will 
serve as an indication that the previous byte did not end with zero. 

3) The string 10 can be located only at the end or at the beginning of some (A, fc)-group. In both cases, it can 
be decoded regardless of the next byte content: in the first case it is decoded together with the (A, fc)-group, 
in which it is included. In the second case, it is decoded as 10. 

It follows from the first of these observations that the sequence r can not contain two consecutive zeros because 
such a situation is possible only if two zeros constitute a full (A, fc)-group (then r does not contain its bits), or 
when the first ”0” is the end of one (A, fc)-group, and the second ”0” is the beginning of the next group (in this 
case r contains only the second zero). It follows from the second and third observations that the sequence r can 
not contain three consecutive ones and the string 10. Thus, we obtain a total 6 possible values of r: empty string, 
0 , 1 , 01 , 11 , 011 . 

Now we show that any row in Table III can be ’’packed” into a single 32-bit machine word. We enumerate all 
possible values of r by binary numbers from 0 to 5, and thus three bits are enough to store any such value. Note 
that if a certain flag fi is zero (this means that the word Wi is not fully decoded), then there is no need to consider 
words Wi+i,Wi+ 2 , ■■■, as well as flags fi+i, fi+ 2 , ■■■, as the code Wi extends to the beginning of the string r or 
to the right boundary of the byte. Denoting these values fi, which can be disregarded, by zeroes, we obtain the 
following possible combinations of flag values /i, /2, /a : 000,100 and llx, where x-is an arbitrary binary value. 
For each of these cases we describe the special method of packing a row of Table III into a four-byte word (Fig. 
2). However, in any case we write the values /i, /2, /a into three most significant bits, the values wi, |wi |; W2,1^21 
(if available); w^, |wa| (if available) and r, from the least significant to the most significant bits, in the specified 
order. 

(/ij/2,/a) = 000. In this case, the value wi takes no more than 10 bits. Indeed, consider first the case when 
Ti-i = Oil. If fi = 0, then the most significant bit of the byte Ui can not be zero, since otherwise there would be a 
sequence 0110, which means the end of the codeword and /i = 1. Assume, that all the bits of Ui are ones. Then the 
last bit refers to r^, and the length of the decoded value rui is 3-1-7 = 10 bits. If Ui contains the zero bit, then during 
decoding of Wi the sequence of the form 01... 10 with more than 2 ones will be processed, which will correspond 
to one bit shorter piece of the code Wi. Therefore, the total bit length of Wi will not exceed 3 -I- 8 — 1 = 10 bits. If 
the value Vi-i contains less than three bits, then the length Wi obviously, cannot be longer than 8 -I- 2 = 10 bits. 
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Fig. 2. Packing a string of decoding table into four-byte computer word 


Thus, in the case of (/i,/2,/3) = 000, four bits are enough to store the value |rt;i|, and, in general, the packing 
of a string of the Tab. 3 in a four-byte word appears as in Fig. 2(a). 

= 100. In this case, the string concatenation Vi^iUi must contain the delimiter 0110 or starts inside 
the delimiter. The value wi will be the longest if the delimiter is shifted to the right boundary of the byte. As the 
delimiter is not taken into consideration during decoding, the value wi will be obtained as a result of decoding at 
most 7 bits, and for reasons set out in the case (/i, /2, /a) = 000, the greatest possible length of lui will be one 
bit less, i.e. Iiuil < 6 and to store the value liuil 3 bits are enough. 

In the case (/i, /2, /a) = 100 we also must store the value W 2 - Since the code wi takes at least one bit of the 
byte Ui, for the code W 2 there remain no more than 7 bits, which requires 3 bits for the value |r(;2| and results in 
the packing as in Fig. 2(b). 

{fit f 2 , /a) = 11a;. In this case, the code wi satisfies the same restrictions as in the case (/i, /2, /a) = 100. The 
code W 2 , which total length does not exceed 7 bits, must also contain a delimiter with no less than three bits. Thus, 
four bits are enough for value W 2 , three bits for |r(;2|- Since the code rui occupies at least one bit of the byte u, 
and the shortest code W 2 is 110, then the length of encoded and decoded values W 3 is not longer than four bits. 
Thus, we get the packing shown in Fig. 2(c). 

Now we describe in detail the byte aligned algorithm of decoding for the code D 2 (Fig. 3). By x << c denote 
the operation of shifting the value x to the left and by x >> c shifting to the right in c bits (shift is not cyclic and 
new bits are filled with zeros). 

The symbol & denotes the bitwise operation ’’and”, and the symbol | stands for the bitwise ”or”. By texti we 
denote another byte of encoded text, by t denote a string from Table III packed in four-byte word. In the variable 
w a decoded number is formed as the string concatenation wi,W 2 or W 3 , and in a variable len the lengths of these 
strings are stored. The initial value w consists of one ”1” bit, then it shifts to the left, and the right bits are replaced 
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by values wi, W 2 or W 3 (from the relevant parts of the word t), and thus the most significant bit of w always 
remains 1. 
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% A — 15 
r ^ 0; 
w ^ 1; 

while (the end of the text is not reached) { 

t ^ TAB[r\\texti]] 
if(f&0x80000000) { 

len ^ {t » 6)&0x7; 
output (w « (en)|(f&0x3F); 

tu ^ 1; 

if(x&0x40000000) { 

len -fr- {t >> 13)&0x7; 

output {w << len)\{{t >> 9)&0xF); 

■u; ^ 1; 


//byte number of the encoded text 


// read out 4-byte string in Tab. 3 
// if /i = 1 
// len ^ Itui I 

// decoded number: w with 

// appended to the right 6 least significant bits of t 

// if /2 = 1 

// len ^ |r(;2| 

// decoded number: lu;2 


len -ir- {t » 20)&0x7; // len ^ Itual 

if(/&0x20000000) { // if /3 = 1 

output {w « len)\{{t » 16)&0xF); // decoded number: lius 

w ^ 1; 


} 


} else 

w {w « len)\{{t » 16)&0xF); 
r {t >> 23)&0x7; 

} else { 

len (t >> 16)&0x7; 
w ^ (w << len)\{{t >> 9)&0x7F); 
r^{t» 19)&7; 

} 

} else { 

len ^ {t >> 10)&0xF; 
w {w « /en)|(/&0x3FF); 
r (t » 14)&0x7; 

} 

/ i + 1; 


//(/l,/2,/3) = 110 
II w ^ Itua 
// r in bits 24-26 

//(/l,/2) = 10 

// len ^ |r(;2| 

II w Iwa 
// r in bits 20-22 

// if /i = 0 
// len <— |tui I 
// append wi to w 
H r in bits 15-17 

// proceed to the next byte 


Fig. 3. Bytewise decoding algorithm for the code D 2 
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Bytewise decoding of D 2 

Bytewise decoding of Fib3 

Memory 

6 K 

2I.4K 

Time 

0.255s 

0.321s 


Let us estimate storage consumption of the method described above. For each of 6 possible values Vi-i there 
exist 256 values Ui, thus Table III contains 6 x 256 strings; 4 bytes are required to store each of them. Thus, the 
memory storage of the bytewise decoding method is 6 Kb. 

Let us compare the space complexity of a given method with fast byte aligned methods used for decoding 
Fibonacci codes. The most detailed study of them is presented in [7], where three such methods are described. The 
fastest of them is the method that involves using the table named Fib3. Its memory storage requires 21.4 Kb, i.e. 
more than 3.5 times greater than the method we propose. 

Time complexities of these methods were compared by numerical experiments. The random 20 million words 
fragment from English Wikipedia text corpus was encoded by the codes D 2 and Fib3 and then decoded by byte 
aligned methods mentioned above. Time of decoding was measured. The experiment was repeated 100 times, and 
the results were averaged. These results are shown in Table IV. As is seen, decoding of D 2 is about 20% faster 
than that of Fib3. This mainly is due to the fact that the decoding of D 2 requires only one memory read operation 
at each iteration, after which all the other operations can be performed in processor registers very rapidly, while 
the mentioned above Fib3 decoding method requires 2 or 3 readings from one- or two-dimensional arrays at each 
iteration. 


IX. Compressing data by multi-delimiter codes 

Applicability of a code for information compressing is largely related to its density, which is measured by the 
number of codewords of the length not exceeding n. Let us first calculate the asymptotic density of the code II2. 
By /„ denote the number of codewords in D 2 of the length n. 

Lemma 2. The following equality holds 

fn = fn-l + fn-2 + fn-3 + fn-G (8) 

Proof: Applying formula (4) to parameters of the code D 2 {t = l,mi = 2) and taking into account that 
Tn — Tn-i = 0 for n > 6, we obtain the following recurrent relation that is true for n > 6 : 

fu = fn-l + 2fn-2 - fn-3 + fn-3 (9) 

By induction, we prove that for n > 7 equality (8) is equivalent to (9). It is necessary to prove the equality of 
right parts (8) and (9), which after reductions takes the form fn -2 — fn -3 + fn -3 = fn -3 + fn-6- This gives the 
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equality 


/n-2 + /„-5 = 2/„_3 +/„-6 (10) 

For n = 7 this equality is easy to check directly. Suppose, it holds for some n > 7. Express /„_i by using 
formula (9): /„_i = /„_2 + 2/„_3 - /„_4 + /„_6. It gives 2/„_3 + /„_6 = /„_i - /„-2 + /n- 4 - Substituting 
this expression to the right side of ( 10 ), we obtain equality /„_i + fn -4 = 2/„_2 + fn-s^ which coincides with 
equality ( 10 ), if replace n by n + 1 . ■ 

By Sn denote the number of codewords, which lengths do not exceed n, Sn = /*■ Taking into account that 

/s = /4 = 1, /s = 2, /g = 3 and, summing over all indices n > 7 both parts of formula (8), we obtain; 

6 n 

Sn = y] /i + X] /* = 

i—3 i—7 

n 

7 + y^(/i-l + /l-2 + fi-3 + fi-e) (11) 

i=7 

Note that the following identities hold; 

n n—1 

i^7 i=6 

n n — 2 

fi-2 = '^fi = Sn-2 - 2; 
i—7 2=5 

n n—3 

^ ^ fi—3 — ^ ^ fi — Sn—3 I? 
i—7 2=4 

n n —6 

^ ^ fi—6 — ^ ^ fi — 'Sn— 6 - 

1^7 2=1 

Substituting these expressions into formula (11), we obtain: 

Sn = Sn-1 + Sn-2 + Sn-3 + Sn-6 (12) 

Since S 2 = si = sq = s_i = • • • = 0, 53 = 1, 54 = 2 ,55 = 4, se = 7, the equality (12) holds for n > 6 . Formula 
(12) allows us to find the generating function G{z) for Sn- 


00 

G{z) = s„2" = + Az^ + 

n—0 

00 

+ SnZ'^ = z^ + 2 z'^ + Az^ + 
n=6 
00 

+ y^(Sn-l + Sn_2 + S„_3 + Sn-g)^;" (13) 

n=6 

Take into account the following equalities: 
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oo oo 

= 

n—6 n—6 

oo 

z 'Y^ Snz'^ = zG{z) — z'^ — 2z®; 

n=5 

oo oo 

^ S„_2Z" = z" ^ S„_22"-" = 

n—6 n—6 

oo 

^ Sn^" = z^G(z) — z®; 
n=4 


n=6 


sn- 32 :"' = z^Y ^"-3-2^" ^ = 

n—6 
00 

Y 


n=3 


^n — 6 
6^: = 


n—6 


Y Sn-ez"’ = Z^Y 

n—6 
00 

z®y]]s„z" = z®G(z). 


n—0 


Substituting these equalities into formula (13) and solving the resulting equation with respect to G(z), we obtain; 


G(z) = 


z^ + z'^ + z® 


1 — z — z^ — z^ — z® 1 — 2z + z^ — z'^ 
Decompose G(z) to the sum of prime fractions 


-0.3618+ 0.2982i 
“ z- 0.809- 0.98161 
-0.3618+ 0.2982i 
“^z- 0.809+ 0.9816i ~ 
0.1888 0.0876 

z + 1.1537 ~ z- 0.5357’ 


(14) 


where i is the imaginary unit, i = V—1- 

As seen from (13), the coefficient Sn equals to the n-th term of the Maclaurin series for the function G(z). If we 
expand function g{z) = into the Maclaurin series, then the n-th term equals to = frr- 

Thus, the order of growth of s„ is determined by the value 1/a", where the value a should be selected by the 
condition that |a| is the smallest value among all fractions of the form in formula (14). This is the last fraction 
in (14). Thus, a = 0.5357 and the order of growth of s„ is given by the expression 


0.5357/ 


1.867" 


(15) 
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TABLE V 

The number oe codewords oe length < n for some codes 


Code 

Asymptotic 

n = 2 

n = 3 

n = 4 

n = 5 

n = 6 

n = 7 

n = 8 

n = 15 

The codes with the shortest codeword of the length 2 

Fib2 

1.618" 

1 

2 

4 

7 

12 

20 

33 

986 

Di 

1.755" 

1 

2 

3 

5 

9 

16 

28 

1432 

Di^2 

1.618" 

1 

3 

5 

7 

10 

16 

27 

799 

^ 1,3 

1.674" 

1 

2 

4 

7 

11 

18 

30 

1106 

The codes with the shortest codeword of the length 3 

Fib3 

1.839" 

0 

1 

2 

4 

8 

15 

28 

2031 

D2 

1.867" 

0 

1 

2 

4 

7 

13 

24 

1906 

D2,3 

1.785" 

0 

1 

3 

6 

11 

19 

33 

1874 

D2,A 

1.823" 

0 

1 

2 

5 

9 

17 

30 

1998 

D2,5 

1.844" 

0 

1 

2 

4 

8 

15 

28 

1999 

^ 2 , 3,4 

1.731" 

0 

1 

3 

7 

13 

23 

39 

1721 

^ 2 , 3,5 

1.755" 

0 

1 

3 

6 

12 

21 

37 

1833 

£> 2 , 4,5 

1.796" 

0 

1 

2 

5 

10 

19 

34 

2019 

£> 2 , 4,6 

1.809" 

0 

1 

2 

5 

9 

18 

32 

2032 

The codes with the shortest codeword of the length 4 

Fib4 

1.928" 

0 

0 

1 

2 

4 

8 

16 

1606 

£>3 

1.933" 

0 

0 

1 

2 

4 

8 

15 

1510 


As shown in [7], among the family of Fibonacci codes of higher orders the code Fib3 gives the best compression 
rate in the case of encoding natural language texts. The asymptotic density of this code is 1.839". Thus, the code 
D 2 is asymptotically denser than Fib3. It is also evident from the simple fact that the number of words of the 
length n in the code D 2 determined by formula (8): /„ = /„_i + fn -2 + fn -3 + fn-e, while for the code Fib3 it 

is fn = fn-1 + fn-2 + fn-3- 

Using the standard technique of generating functions, it is not difficult to calculate the asymptotic density of 
other multi-delimiter codes. For several such codes that may be of interest from the practical point of view, as well 
as for several Fibonacci codes, these values together with numbers of short codewords are given in Table V. 

As seen, many multi-delimiter codes contain a larger number of short codewords than the comparing Fibonacci 
codes with the same length of the shortest codeword. The ’’champions” are the codes I? 2 , 3 , 7?2,3,4, £* 2 , 3 , 5 , and 
£* 2 . 4 , 5 - They are the candidates for efficient compression. However, the code £* 2 , 3,4 has quite low asymptotic 
density, which narrows its application area to only small alphabets. We investigate more thoroughly the other three 
codes together with the code D 2 , which has the highest asymptotic density in the class of codes with the shortest 
word of the length 3. 

Compression efficiency of multi-delimiter codes was experimentally measured on different sources of English 
texts. Namely, we took the Bible (King James version), three other famous pieces of writing, and the full content 
of English Wikipedia. The results are presented in Table VI in terms of the average codeword length. We compared 


DRAFT 


August 7, 2015 




TABLE VI 

Empirical comparison of compression rate (the average codeword length) of Fib3 and some multi-delimiter codes 
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Source 

Alphabet size 

Fib3 

D 2 

D2,3 

^2,3,5 

£*2,4,5 

Bible KJV 

12,452 

9.21 

9.35(-l-1.6%) 

9.03(-2%) 

8.95(-2.8%) 

9.04(-1.8%) 

Hamlet, Shakespeare 

4,501 

10.0 

10.16(+1.6%) 

9.82(-1.8%) 

9.74(-2.5%) 

9.81(-1.9%) 

Robinson Crusoe, D. Defoe 

5,994 

9.4 

9.55(-l-l, 6%) 

9.19(-2.2%) 

9.12(-3%) 

9.21(-2%) 

Oliver Twist, C. Dickens 

10,027 

10.06 

10.21(+1,5%) 

9.91(-1.6%) 

9.84(-2.3%) 

9.89(-1.7%) 

English Wikipedia 

5,487,696 

11.585 

11.696(-l-l%) 

11.521(-0.6%) 

11.517(-0.6%) 

11.497(-0.8%) 


the performance of multi-delimiter codes and the Fibonacci code Fib3, which is taken as the base for comparisons. 
This code is known as the most efficient for natural language text compression among all Fibonacci codes. 

As seen, the codes with 2 and 3 delimiters outperform the Fib3 code. For example, the average codeword length 
for the code £ 12 , 3,5 is about 2 — 3% less than that for the code Fib3, if the alphabet size is around lOK words. This 
is a significant difference if we take into account that the code Fib3 exceeds the entropy bound only by 5 — 6% 
for English texts, as reported in [7]. Since the asymptotic density of multi-delimiter codes is lower, their overheads 
over Fib3 decreases as alphabet size grows. However, codes with 2 and 3 delimiters are still superior even for 
Wikipedia, which is one of the largest known natural language text corpus up to date, containing over 5 million 
different words. 

The code Fib3, in comparison with the multi-delimiter codes, also has a drawback, which refers to the characteristic 
of the instantaneous separation that is important for searching a word in the compressed file without its decompression. 
As Fib3, so multi-delimiter codes as well as other codes used for text compression are characterized by the following: 
if a certain bit sequence w occurs in a compressed file, we can not guarantee that it truly corresponds to the 
occurrence of the whole codeword w, since it could be the suffix of another codeword. In multi-delimiter codes 
to check if w is truly a separate codeword it is enough to consider a fixed number of bits that precede w. For 
example, it is enough to check 4 bits for the code £> 2 . If they turn out to be 0110, then m is a codeword, otherwise 
it is not. However, it is not enough to check any fixed number of bits preceding a codeword in the code Fib3, 
since a delimiter and the shortest word in this code is 111. Several such codewords can ’’stick together” if they are 
adjacent. As one of the ways to avoid this problem, in [7] it is proposed to extract the shortest codeword 111 from 
the code Fib3. However, the density and compression efficiency of the code obtained in this way is significantly 
worse than those for all the codes discussed above, including 02 - 

X. Conclusion 

In this paper we introduce a new family of splittable codes that are based on encoding sequences of ordered 
integer pairs. Splittable codes form a rich set of codes that include the (2,3)-codes, the Fibonacci codes of higher 
orders and the multi-delimiter codes. 

The multi-delimiter codes are of special interest. They possess all properties known for the Fibonacci codes such 
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as completeness, universality, simple vocabulary representation, and strong robustness. But also they have some 
more advantages: 

(i) Adaptability. Varying delimiters we can adapt a multi-delimiter code to a given source probability distribution 
and an alphabet size. 

(ii) Better compression rate for natural language text compressing. 

(iii) Good computer performance minimizing time and storage overheads. 

(iv) Instantaneous separation of codewords allowing faster compressed search. 

The set of multi-delimiter codes together with the set of Fibonacci codes can be useful in many practical 
applications. 
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